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Abstract 

There  exist  Fast  Fourier  transform  (FFT)  algorithms, 
called  dimensionless1  FFTs,  that  work  independent  of  di¬ 
mension.  These  algorithms  can  be  configured  to  compute 
different  dimensional  DFTs  simply  by  relabeling  the  in¬ 
put  data  and  by  changing  the  values  of  the  twiddle  factors 
occurring  in  the  butterfly  operations.  This  observation 
allows  us  to  design  an  FFT  processor,  which  with  minor 
reconfiguring,  can  compute  one,  two,  and  three  dimen¬ 
sional  DFTs.  In  this  paper  we  design  a  family  of  FFT 
processors,  parameterized  by  the  number  of  points,  the 
dimension,  the  number  of  processors,  and  the  internal 
dataflow,  and  show  how  to  map  different  dimensionless 
FFTs  onto  this  hardware  design.  Different  dimension¬ 
less  FFTs  have  different  dataflows  and  consequently  lead 
to  different  performance  characteristics.  Using  a  perfor¬ 
mance  model  we  search  for  the  optimal  algorithm  for 
the  family  of  processors  we  considered.  The  resulting 
algorithm  and  corresponding  hardware  design  was  im¬ 
plemented  using  FPGA. 

1  Introduction 

In  many  applications,  the  Fast  Fourier  Transform  (FFT) 
presents  an  intensive  computational  task  due  to  the 
amount  of  data  to  be  processed.  The  amount  of  data  (i.e., 
problem  size)  depends  on  the  number  of  points  and  the 
dimension  of  the  transform.  To  this  end,  engineers  and 
scientists  rely  on  approaches  such  as  highly-tuned  code 
for  uniprocessors,  DSP  processors,  ASIC,  IP  cores,  and 
reconfigurable  architecture,  to  meet  the  performance  re¬ 
quirements  with  respect  to  other  design  constraints  such 
as  physical  space.  A  list  of  references  to  these  approaches 
is  provided  in  [1] .  Our  study,  which  is  part  of  the  SPIRAL 
project  [?],  focuses  on  using  mathematical  properties  of 
the  FFT  to  help  us  design  a  high-performance  hardware. 

The  novelty  of  our  work  is  threefold.  First  we  base  our 
processor  on  the  dimensionless  FFT  [2]  which  allows  a 
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single  hardware  design  to  compute  one,  two,  and  three 
dimensional  DFTS.  Second,  we  provide  a  framework  for 
systematically  mapping  alternative  FFT  algorithms  onto 
parameterized  hardware  designs.  This  is  obtained  by 
mapping  a  mathematical  description  of  the  FFT,  based 
on  matrix  factorizations  [3] ,  to  hardware  that  implements 
flow  control  and  generation  of  the  necessary  roots  of 
unity  (twiddle  factors).  Finally,  there  are  many  differ¬ 
ent  FFT  algorithms,  each  with  different  dataflow,  and 
consequently  different  performance  characteristics.  Thus 
our  hardware  design  becomes  on  optimization  problem 
over  the  space  of  possible  FFT  dataflows  [4] . 

We  thus  propose  a  universal  FFT  engine  that  is  pa¬ 
rameterized  in  terms  of  the  number  of  points  and  di¬ 
mension  of  the  transform,  and  the  choice  of  the  algo¬ 
rithm.  We  consider  a  distributed  architecture  comprised 
of  processing  units  (with  local  memory)  connected  via 
an  interconnection  network.  We  derived  a  class  of  op¬ 
timal  FFT  dataflow  diagrams  based  on  memory-access 
cost  function.  The  derivation  followed  a  design  flow 
that  uses  a  performance  model  and  its  simulation  to 
evaluate  the  choice  of  algorithm  prior  to  design  of  the 
hardware  [5].  Implementation  of  the  engine  for  proof  of 
concept  on  the  Wildforce™  [6]  reconfigurable  (FPGA) 
board  is  performed  in  two  steps,  hardware  description 
language  model  simulation  of  the  board  and  the  actual 
execution  of  the  configured  board.  We  have  validated  the 
actual  configured  board  and  are  in  the  process  of  bench¬ 
marking  its  performance.  Future  implementation  may 
use  the  ASIC  technology  for  the  floating-point  (complex 
numbers)  arithmetical  cores  and  the  FPGA  technology 
for  the  parameterized  flow  control  units. 

In  Section  2  we  review  the  dimensionless  FFT  and  the 
space  of  FFT  algorithms  that  we  will  consider.  In  Sec¬ 
tion  3  we  describe  a  family  of  FFT  processors  and  the 
mapping  of  algorithms  in  Section  2  to  this  architecture. 
In  Section  4  we  introduce  a  performance  model  for  the 
architecture  in  Section  3  and  find  the  optimal  algorithm 
with  respect  to  this  model.  In  Section  5  we  describe  the 
implementation  of  the  design  selected  in  Section  4.  De¬ 
tails  not  provided  in  this  paper  can  be  found  in  [1], 


2  Dimensionless  FFT  Algorithms 

Let  X(a4 ,at)  be  a  function  of  t  variables,  where  0  < 
di  <  rii.  The  f-dimensional  n4  x  •  •  •  x  nt  DFT  of  X  is 

X(b\, . . .  ,bt)  =  ^2  e  "T“1&1  •  •  •  e~^atbt X(ai, . . .  ,at) 

0  <a.i  <rij 

The  multidimensional  DFT  can  be  interpreted  as  a 
matrix- vector  product.  Let  x  and  x  be  the  vectors  of  size 
N  =  ri\  ■  ■  ■  nt  obtained  by  ordering  the  elements  of  X  and 
X  lexicographically.  Then,  x  =  (Fni  ®-  •  -®Fnt)  x,  where 
0,  denotes  the  tensor  (Kronecker)  product  and  Fn.  is  the 
n,-point  discrete  Fourier  matrix  [3] .  FFT  algorithms  can 
be  represented  by  factorizations  of  this  matrix  [9,  3]. 

A  dimensionless  FFT  [2]  can  compute  any  multidimen¬ 
sional  DFT  where  n\  ■  ■  ■  nt  =  N,  for  a  fixed  N.  For  exam¬ 
ple,  a  Fourier  transform  on  16  points  can  have  dimension 
equal  to  1  (F46),  2  (F2  0  F8,  F4  0  f4,  and  F8  0  F2),  3 
(F2  0  F2  ®  Fi,  F2  0  F4  0  F2l  and  F4  ®  F2  0  F4),  or  4 
(F2  0  F2  0  F2  0  F2).  Independent  of  dimension,  these 
matrices  can  be  factored  into 

P4(l8  0  F2)T4P3(Is  ®  F2)T3P2(Is  ®  F2)T2Pi  (Is  0  F2)T4PoPd, 

where  I8  is  an  identity  matrix,  the  Pi  are  permutation 
matrices  and  the  Ti  are  diagonal  matrices.  Only  the  twid¬ 
dle  factors,  Ti  and  the  initial  permutation  Pd  change  as 
the  dimension  changes.  The  internal  dataflow,  given  by 
the  permutations  Pi  are  fixed  and  hence  the  factorization 
provides  a  dimensionless  FFT. 

In  particular,  if  we  set  Pq  =  I48  and  P4  =  P2  =  P3  = 
P4  =  L26,  the  stride  permutation  [9]  which  gathers  ele¬ 
ments  of  a  vector  of  size  16  at  stride  2 

/  0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  \ 
^02468  10  12  14  13  5  7  9  11  13  15  y  ’ 

the  algorithm  is  called  the  dimensionless  Pease  algorithm 
because  it  corresponds  in  the  one-dimensional  case  to  an 
algorithm  described  by  M.  Pease  in  [8] .  Table  1  shows  Pd 
and  the  twiddle  factors  that  configure  the  dimensionless 
Pease  algorithm  to  compute  a  1-D  and  a  2-D  FFT.  The 
permutation  R2t  is  the  f-bit  bit  reversal  permuation  [9]. 

The  set  of  internal  permutations  define  the  dataflow  of 
the  FFT.  Figure  1  shows  the  dataflow  of  the  Pease  algo¬ 
rithm  (the  boxes  indicate  an  F2  with  twiddle  computa¬ 
tion).  Alternative  algorithms  with  different  dataflow  ex¬ 
ist  (see  [4]  for  a  classification  of  possible  dataflows).  For 
example,  the  permutations  P0  =  (L®®-^))  Pi  =  L^(L\® 
h),  Pi  =  (L\  0  I2)Lf,  P3  =  {L\  0  h)Lf(Ll  ®  I2), 
P4  =  L26(L2  0  I2),  define  an  algorithm  whose  dataflow 
is  shown  in  Figure  2. 

In  the  following  sections  we  will  see  that  different 
dataflows  lead  to  different  performance  characteristics. 
The  set  of  possible  dataflows  are  those  sequences  of  per¬ 
mutations  that  can  be  configured  to  compute  the  FFT. 


That  is,  those  dataflows  for  which  there  exist  an  initial 
permutation  and  a  sequence  of  twiddle  factors,  such  that 
the  resulting  matrix  factorization  is  equal  to  any  compat¬ 
ible  multidimensional  DFT.  To  optimize  our  design,  we 
search  over  the  space  of  allowable  dataflows  for  the  one 
with  the  best  performance. 


Table  1:  Configuration  of  the  dimensionless  Pease  algorithm. 


To  configure  for  (1-D)  Fi6,  set  Pd  =  Ri6,  ui  =  e  is  , 
T1=diag(  1,1, 1,1, 1,1, 1,1, 1,1, 1,1, 1,1, 1,1) 

T2  =  diag(  1, 1, 1, 1, 1, 1, 1, 1, 1,  w4, 1,  w4, 1,  w4, 1,  ud) 

T3  =  diag(  1, 1, 1, 1, 1,  w2, 1,  l,cd2,  l,u4,  l,u>4, 1, c/,  l,t/) 

T4  =  diag(  1, 1,  l,w,  1,  u>2, 1,  l,o>3, 1,  ui4,  l,u>5,  l,w6, 1  ,u>7) 

To  configure  for  (2-D)  F4  0  F4,  set  Pd  =  R4  0  R4,  u>  =  e^e~ , 
Ti  =  diag(l,  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  1) 

T2  =  diag(  1, 1, 1, 1, 1, 1, 1, 1, 1,  cc4, 1,  w4,  l,w4, 1  ,co4) 

T3  =  diag(  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  1) 

T4  =  diagj  1, 1, 1, 1, 1, 1, 1, 1,  l,x4,  l,cc4,  l,u>4,  l,u>4) _ 
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Figure  1:  Dataflow  diagram  for  dimensionless  Pease  algorithm 
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Figure  2:  Dataflow  diagram  for  alternative  FFT  algorithm 


3  Architectural  Framework 


In  this  section,  we  describe  the  architecture  of  our  FFT 
processor,  and  illustrate  how  the  algorithms  from  the  pre¬ 
vious  section  can  be  mapped  onto  this  architecture.  The 
proposed  architecture  shown  in  Figure  3  contains  3  main 
units:  the  interface,  the  interconnection  network,  and  the 
processor  elements  (PEs.)  The  interface  unit  is  used  to 
transfer  parameters  and  data  to/from  the  system.  The 
reconfigurable  interconnection  network  provides  the  com¬ 
munication  between  the  PEs.  Each  PE  has  3  main  units: 
a  memory  (M),  a  computation  unit  (CU),  and  an  address 
generator  (AG) .  This  design  is  similar  to  the  Xputer  pro¬ 
posed  in  [?]. 

When  used  as  an  FFT  processor,  the  data  is  dis¬ 
tributed  amongst  the  processor  memories,  the  computa¬ 
tion  unit  is  used  to  generate  twiddle  factors  and  perform 
butterfly  operations,  and  the  address  generators  calculate 
the  addresses  of  the  two  inputs  to  each  butterfly  opera¬ 
tion.  In  order  to  map  an  FFT  algorithm,  represented  as  a 
matrix  factorization,  onto  this  architecture,  address  gen¬ 
erators  must  be  configured  from  the  permutations  occur- 
ing  in  the  factorization,  and  the  computation  unit  must 
be  configured  to  compute  the  appropriate  twiddle  fac¬ 
tors.  The  initial  permutation  is  incorporated  into  the 
interface  so  that  the  data  arrives  in  the  appropriate  or¬ 
der.  To  simplify  the  address  generators,  we  assume  that 


Figure  3:  The  architecture 


the  FFT  operates  on  a  vector  of  size  N  =  2™,  the  number 
of  PEs  is  equal  to  M  =  2m,  and  that  the  permutations 
occuring  in  the  FFT  are  tensor  permutations.  A  tensor 
permutation  is  a  permutation  obtained  by  permuting  the 
bits  in  the  binary  representation  of  the  addresses  [9] .  Bit 
reversal  and  stride  permutations  are  examples  of  tensor 
permutations. 

The  N  elements  of  the  input  vector  are  distributed  con¬ 
secutively  in  segments  of  size  N/M  to  the  processor  mem¬ 
ories.  A  memory  location  containing  data  is  assigned  a 
global  n-bit  address  ( bn-\bn-2  • . .  bo),  where  the  leading 
m  bits  are  the  PE  number  and  the  trailing  n  —  m  bits 
are  a  local  offset.  To  conserve  memory,  the  computation 
is  performed  inplace.  This  requires  a  modification  to  the 
factorization  presented  in  the  previous  section,  so  that 
each  stage  is  of  the  form  P(I  ®  F2)TP~1.  The  conju¬ 


gating  permutation  P  determines  the  addresses  for  each 
butterfly  operation.  Since  all  permutations  are  assumed 
to  be  tensor  permutations,  the  permuted  addresses  can 
be  generated  by  permuting  the  bits  of  a  binary  counter. 

For  example,  the  16-point  Pease  algorithm  is  trans¬ 
formed  into  the  four  stage  factorization 

Ll6(I8  ®  F2)T4L\ 6  •  Lf  (I8  ®  F2)T3L\6  ■ 

Lf{h  8>  F2)T2L\6  •  [Is  0  F)2)TlPd. 

The  addresses  obtained  for  the  8  butterfly  operations 
(two  consecutive  data  elements  are  used  in  a  butterfly 
operation)  in  each  of  the  four  stages  are  shown  in  Ta¬ 
ble  2.  Each  stage  shows  the  permuted  address  bits,  and 
all  addresses  are  obtained  by  counting  from  0  to  15. 

Table  2:  Butterfly  addresses  for  the  16-bit  Pease  algorithm 


Stage  0 
63&2&160 

Stage  1 

62f>l&0&3 

Stage  2 

bibob3b2 

Stage  3 

60 b3 &26i 

0000->P0 

0000->Po 

0000->Po 

0000->P0 

0001->P0 

0010->Po 

0100->Po 

1000->P0 

0010->Pi 

0100->Pi 

1000->Pi 

0001->Pi 

0011->Pi 

0110->Pi 

1100->Pi 

1001->Pi 

0100->P2 

1000->P2 

0001->P2 

0010->P2 

0101->P2 

1010->P2 

0101->P2 

1010->P2 

0110->P3 

1100->P3 

1001->P3 

0011->P3 

0111->P3 

1110->P3 

1101->P3 

1011->P3 

1000->Po 

0001->Po 

0010->P() 

0100->P0 

1001->P0 

0011->Po 

0110->P() 

1100->P0 

1010->Pi 

0101->Pi 

1010->Pi 

0101->Pi 

1011->Pi 

0111->Pi 

1110->Pi 

1101->Pi 

1100->P2 

1001->P2 

0011->P2 

0110->P2 

1011->P2 

1011->P2 

0111->P2 

1110->P2 

1110->P3 

1101->P3 

1011->P3 

0111->P3 

1111->P3 

1111->P3 

1111->P3 

1111->P3 

Butterfly  operations  are  assigned  to  PEs  using  a  round 
robin  schedule.  Assuming  4  PEs,  Table  3  shows  the  ad¬ 
dress  sequences  for  each  PE  for  the  16-bit  Pease  algo¬ 
rithm.  The  twiddle  factors  needed  for  a  butterfly  are 
determined  in  a  manner  similar  to  address  calculation. 

Table  3:  Address  sequences  for  16-point  Pease  algorithm 


PE 

Stage  0 

Stage  1 

Stage  2 

Stage  3 

0 

0, 1,8,9 

0,2, 1,3 

0,4, 2, 6 

0,8,4,12 

1 

2,3,10,11 

4, 6, 5, 7 

8,12,10,14 

1,9,5,13 

2 

4,5,12,13 

8,10,9,11 

1,5, 3, 7 

2,10,6,14 

3 

6,7,14,15 

12,14,11,15 

9,13,7,15 

3,11,7,15 

4  Performance  Model 

In  Section  2,  it  was  shown  that  there  are  different  FFT 
algorithms  with  different  dataflow  patterns.  Each  algo¬ 
rithm  can  be  mapped  onto  the  architectural  framework 
outlined  in  Section  3.  It  is  not  clear  a  priori  which  algo¬ 
rithm  should  be  used  when  selecting  the  ultimate  design 
for  the  FFT  processor.  Using  the  performance  model, 
a  search,  over  a  subset  of  possible  FFT  algorithms,  was 
performed  in  order  to  select  the  most  efficient  design. 
The  algorithm  found  using  this  search  process  substan¬ 
tially  reduces  the  traffic  over  the  interconnection  network 
when  compared  to  the  Pease  algorithm. The  use  of  a  per¬ 
formance  model,  instead  of  complete  simulation,  allows 
us  to  explore  many  different  possible  designs  at  an  early 
stage  of  the  design  process.  The  use  of  performance  mod¬ 
els  early  in  the  design  process  has  been  promoted  in  [10]. 

The  performance  model  was  implemented  using 
ADEPT  [5],  a  performance  modeling  tool  based  on  Petri- 
Nets  and  implemented  in  VHDL.  In  the  performance 
model,  data  are  represented  by  tokens  containing  infor¬ 
mation  that  affects  performance.  The  flow  of  the  tokens 
emulates  the  dataflow  in  the  system.  A  token  in  our 
system  contains  a  pair  of  addresses  corresponding  to  a 
butterfly  operation.  The  sequence  of  addresses  are  gen¬ 
erated  and  mapped  to  PEs  by  a  scheduler.  The  PE, 
the  memory  (M)  and  the  interconnection  models  have  a 
mechanism  of  passing  the  tokens  that  imitates  the  com¬ 
putation  steps  including  memory  read,  twiddle  factor  and 
butterfly  operations,  and  memory  write.  The  operations 
and  the  memory  accesses  are  emulated  as  delays  while 
the  address  sequences  dictate  the  flow  of  data.  We  use 
the  total  simulation  time  (not  the  CPU  time)  reported 
by  the  VHDL  simulator  as  the  performance  cost. 

The  optimization  problem  is  to  find  the  FFT  dataflow 
with  minimal  running  time.  The  set  of  dataflows  consid¬ 
ered  are  obtained  from  the  Pease  algorithm  by  multiply¬ 
ing  the  permutations  by  tensor  permutations  of  the  form 
P  ®  I2-  Any  such  permutations  lead  to  a  valid  dataflow 
and  any  valid  dataflow  can  be  obtained  in  such  a  way. 
Since  an  exhaustive  search  is  prohibative,  the  search  was 
limited  to  the  case  where  P  is  a  stride  permutation.  The 
search  was  performed  with  a  model  using  four  proces¬ 
sors  with  the  number  of  data  points  ranging  from  16  to 
1024.  Figure  4  compares  the  performance  of  the  optimal 
dataflow  found  with  the  Pease  algorithm. 

The  addressing  for  the  optimal  algorithm  for  16  points 
is  given  in  Table  4.  The  addresses  were  generated  by 
the  sequence  of  bit  permutations  (b^bib^bo),  ( ), 
(b^b^bib^),  (bobib^)-  Comparing  this  sequence  of  ad¬ 
dresses  to  the  Pease  algorithm  shows  that  the  Pease  al¬ 
gorithm  has  36  non-local  memory  accesses  while  the  op¬ 
timal  algorithm  only  has  16.  There  is  a  pattern  in  the 
optimal  dataflow  found  by  the  search,  which  can  be  pa- 


Figure  4:  Comparison  of  the  optimal  and  Pease  algorithms 
Table  4:  Address  sequences  for  optimal  16-point  algorithm 


PE 

Stage  0 

Stage  1 

Stage  2 

Stage  3 

0 

0,1,2, 3 

0,2, 1,3 

0,4, 1,5 

0,8,2,10 

1 

4, 5, 6, 7 

4, 6, 5, 7 

2, 6, 3, 7 

4,12,6,14 

2 

8,9,10,11 

8,10,9,11 

8,12,9,13 

1,9,3,11 

3 

12,13,14,15 

12,14,11,15 

10,14,11,15 

5,13,7,15 

rameterized  by  the  number  of  processors  and  data  points. 
This  dataflow  has  many  interesting  properties  which  sim¬ 
plify  its  implementation:  (1)  all  data  access  in  the  first 
m  stages  is  local  and  (2)  in  the  remaining  stages,  half  of 
the  data  accessed  by  a  processor  is  local  and  the  other 
half  is  exchanged  with  one  other  processor. 

5  Implementation 

In  this  section,  we  describe  the  implementation  of  a  uni¬ 
versal  FFT  processor  on  the  Wildforcc™  [6]  FPGA 
board.  The  processor  design  is  based  on  the  optimal 
dataflow  found  in  the  previous  section.  The  board  con¬ 
sists  of  5  Xilinx  FPGA  chips  (XC4085XLA),  with  local 
memories,  connected  via  a  configurable  crossbar  inter¬ 
connect.  The  board  communicates  with  a  host  using  a 
PCI  bus.  The  architecture  in  Section  4,  is  mapped  to  the 
board  as  follows.  One  processor,  CPEO,  along  with  its 
FIFO  is  used  for  interface  unit.  The  remaining  proces¬ 
sors  implement  four  PEs  each  with  a  computation  unit 
and  address  generator.  The  crossbar  is  used  for  the  in¬ 
terconnection  network. 

Before  the  computation  is  performed  the  processor  is 
configured  using  parameters  containing  size  and  dimen¬ 
sion  information.  The  interface  uses  this  information  to 
perform  the  initial  permutation,  Pd,  and  the  remaining 


PEs  use  the  parameters  to  configure  their  address  gener¬ 
ators  and  the  computation  units. 

After  the  board  is  configured,  data  is  downloaded  and 
distributed  to  the  memories  of  the  PEs.  Computation  is 
performed  on  single-precision  complex  data.  The  com¬ 
putation  is  divided  into  2  phases:  “local”  and  “remote” . 
During  the  local  phase  all  data  are  in  the  local  memory 
modules.  During  each  stage  of  the  the  remote  phase, 
half  of  data  is  available  in  local  memory  while  the  other 
half  must  be  obtained  over  the  interconnect.  Before  each 
remote  stage  the  interconnect  must  be  configured  corre¬ 
sponding  to  the  permutation  at  that  stage. 

The  address  generator  in  each  PE  generates  a  sequence 
of  addresses.  The  controller  uses  the  address  to  read  the 
data  which  is  sent  to  the  computation  unit  via  a  FIFO 
while  the  address  is  put  in  a  FIFO  for  the  writing.  The 
computation  unit  includes  the  twiddle  factor  generator 
and  arithmetic  units  for  performing  butterfly  operations. 
The  computation  unit  is  implemented  using  pipelining  so 
that  an  output  is  ready  every  other  clock  cycle  as  long 
as  inputs  are  fed  continuously.  The  implementation  re¬ 
quires  one  pipelined  floating-point  adder  and  one  pipeline 
floating-point  multiplier.  The  output  from  the  computa¬ 
tion  unit  is  written  back  to  memory  at  the  same  address 
stored  in  the  FIFO. 

Special  properties  of  the  permutations  in  the  optimal 
dataflow  allow  address  generation  to  use  the  adder  shown 
in  Figure  5.  The  base  address  (INT),  and  increment 
(INC)  are  computed  from  the  stage  number  and  the  num¬ 
ber  of  points  (see  [1]). 


Figure  5:  Adder  used  for  generating  address  sequence 


6  Summary  and  Future  Work 

In  this  paper  we  have  presented  a  family  of  algorithms, 
called  dimensionless  FFTs,  for  computing  multidimen¬ 
sional  FFTs.  We  also  introduced  a  parameterized  family 
of  FFT  processors  and  showed  how  to  map  algorithms 
onto  this  hardware  design.  Finally,  using  a  performance 
model  we  were  able  to  find  the  optimal  FFT  algorithm  for 
the  architecture  we  considered.  A  prototype  implementa¬ 
tion  of  the  optimal  algorithm  and  resulting  hardware  de¬ 
sign  was  carried  out  using  FPGA  and  the  resulting  board 


configuration  was  validated.  Currently  we  are  in  the  pro¬ 
cess  of  benchmarking  our  implementation  and  comparing 
it  to  other  FFT  processors.  In  the  future  we  hope  to  ex¬ 
tend  this  methodology  to  a  larger  class  of  algorithms  and 
processor  designs. 
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